Introduction to Open Data Science - Course Project

About the project

Ruminations about the Introduction to Open Data Science (IODS) 2021

Feelings

The course Introduction to Open Data Science (IODS) feels like a good introduction to the R-Rstudio-git pipeline advocating for reproducible well-documented research.

Challenges may arise due to

  • various backgrounds of the students
  • technological hurdles

Outcomes

I hope the course will generate mindsets favoring open research equipped with technical skills to produce good research output.

I hope to learn not only technical skills but also ways to endorse open data science.

Where to find the course

I found the course in Sisu. Apparently it is also found in the MOOC platform.

Repo

A fork of the course can be found at my repo.


Week 2 - Simple linear regression

date()
## [1] "Sat Nov 20 12:56:39 2021"

Introduction

This week, we are analyzing and modeling a survey data concerning the approaches to learning. Description of the data can be found here and here.

The data has been preprocessed to include variables on gender, age, attitude, exam points, and scores on deep, surface and strategic questions.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

# read in the data
learning2014 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")

dim(learning2014)
## [1] 166   7

There are 166 rows and 7 columns.

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The variables are mostly numeric. Gender is of character value, change it to factor.

# change to factor
learning2014$gender <- as.factor(learning2014$gender)

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Seems like most of the students are female, age spans 17-55, exam points are between 7-33 and the question variables are capped at 5 (from documentation, on a scale 1-5).

Graphical overview

Using GGlally::ggpairs, we can plot the distribution of the variables, correlation (with significance), boxplots and scatterplots stratified by the gender.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)),  
             upper = list(continuous = wrap("cor", size = 3)))

p

We see that

  • Most students are female
  • Age is right skewed (more of smaller age)
  • There is a positive correlation between points and attitude
  • There is, as expected, a negative correlation between surf and deep questions
  • There is a negative correlation between surf and attitude
  • There is a negative correlation between surf and stra questions
  • Distribution of the exam points is not perfectly normal (can be tested with e.g. stats::shapiro.test)

Linear regression

For pedagogical reasons, let’s fit a simple multivariable linear regression first with four (instead of three; choosing three would not change the results) explanatory variables, ie. regress exam points on other variables. Choose variables based on their absolute correlation with the exam points.

sort(abs(cor(learning2014[, -1])[, c("points")]))
##       deep        age       surf       stra   attitude     points 
## 0.01014849 0.09319032 0.14435642 0.14612247 0.43652453 1.00000000
model <- lm(points ~ attitude + stra + surf + age, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8639  -3.2849   0.3086   3.4548  10.7028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.70644    3.96234   3.459 0.000694 ***
## attitude     3.38964    0.57041   5.942 1.68e-08 ***
## stra         0.93123    0.53986   1.725 0.086459 .  
## surf        -0.76565    0.80258  -0.954 0.341524    
## age         -0.09466    0.05346  -1.771 0.078502 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.262 on 161 degrees of freedom
## Multiple R-squared:  0.2226, Adjusted R-squared:  0.2033 
## F-statistic: 11.52 on 4 and 161 DF,  p-value: 2.987e-08

From the summary, the intercept is very significantly non-zero based on a t-test (H_0: intercept is zero, H_1: intercept is not zero) although the result is not very meaningful as e.g. age cannot be zero in this context.

The attitude variable has a very significant effect on the exam points (p-value about zero for a t-test for H_0: attitude has no effect on the slope when the other variables are kept constant, H_1: attitude has an effect). Variables stra and age are weakly significant or non-significant depending on the nomenclature (same test). Variable surf is not significant, let’s remove it and refit.

model <- lm(points ~ attitude + stra + age, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

Now all explanatory variables are significant with a 0.10 significance cutoff.

The F-test is significant, some of the variables are therefore associated with the response variable (have non-zero coefficient).

Interpretation of the linear model

From the summary above, we see that as attitude increases by one, exam points increase by about 3.48 when other variables are kept constant. For stra, the value is about 1.00. Age seems to have a smaller effect, with an additional year decreasing exam points by around 0.09 if other variables do not change.

Multiple R-squared is 0.2182, meaning that the three explanatory variables account for approximately 22% variation in the exam points. This seems quite low and we should be careful when e.g. predicting exam points for new students with this model. The adjusted R-squared considers the number of explanatory variables and has here a similar, a bit smaller, value to the unadjusted R-squared.

Diagnostics

Linear models carry assumptions that need to be verified. We use diagnostic plots for visual assessment.

Residuals versus fitted

# residuals vs fitted values
plot(model, which = 1)

There doesn’t seem to be a clear pattern (maybe visually somewhat of a funnel with decreasing variance but Breusch-Pagan test car::ncvTest(model) doesn’t detect heteroskedasticity) in the residuals versus fitted values in the sense that the variance in residuals is similar across the fitted values. Also, the variation is approximately symmetrical around zero. Constant variance assumption appears to be met. Linear model appears to be OK for the data.

Three outliers (set by id.n in plot.lm) are detected (rows 35, 56 and 145), let’s check them.

print(learning2014[c(35, 56, 145), ])
##     gender age attitude     deep  stra     surf points
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 145      F  21      3.5 3.916667 3.875 3.500000      7
annotate <- c(35, 56, 145, 4, 2)  # 4 and 2 for leverage below

# make the plot tighter
op <- par(mfrow=c(3, 1),
          mar = c(4, 4, 1, 1),
          mgp = c(2, 1, 0))

parameters <- c("attitude", "stra", "age")

for (param in parameters) {
  print(param)
  plot(learning2014[, param], learning2014$points, 
       xlab = param, ylab = "points")
  
  text(learning2014[annotate, param], 
       learning2014[annotate, "points"], 
       annotate, pos = 4, col = "red")
}
## [1] "attitude"
## [1] "stra"
## [1] "age"

par(op) # return original par

It appears that here are the students that got low exam points despite having a high attitude value.

Plot boxplot for assessment of symmetry of the residual distribution.

boxplot(resid(model))

Does not seem too bad, but is not perfectly symmetric. QQ-plot will aid in deciding normality.

QQ-plot

# qq-plot
plot(model, which = 2)

Standardized residuals follow the linear pattern quite reasonably with the same outlier exceptions as above. Hence, there is no strong reason to suspect deviation from normality in the distribution of the residuals.

Residuals versus leverage

# residuals vs. leverage
plot(model, which = 5)

Observations 2, 4 and 56 have been marked as outliers, they have relatively high influence of the regression line compared to other observations. We can rerun the model without these to see if the multiple R-squared is increased.

Cook’s distance is also low, supporting the notion that there are no outliers having a great effect on the linear fit.

max(cooks.distance(model))
## [1] 0.1202162
summary(lm(points ~ attitude + stra + age, data = learning2014[-c(2, 4, 56), ]))
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014[-c(2, 
##     4, 56), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.822  -3.520   0.348   3.617  10.919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.38472    2.58566   3.243  0.00144 ** 
## attitude     3.64594    0.53695   6.790 2.11e-10 ***
## stra         0.84219    0.51066   1.649  0.10108    
## age          0.01968    0.05677   0.347  0.72930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.008 on 159 degrees of freedom
## Multiple R-squared:  0.2419, Adjusted R-squared:  0.2276 
## F-statistic: 16.91 on 3 and 159 DF,  p-value: 1.394e-09

Multiple R-square is indeed a bit better. However, removal of the outliers would have to be justified from the data (e.g. are these students somehow different).

VIF

library(car)
## Loading required package: carData
vif(model)
## attitude     stra      age 
## 1.004083 1.014233 1.010865

Variance inflation factors are less than 10 (textbook), so there is no concern for collinearity.

AIC

We can try automatic stepwise model selection by AIC criterion.

library(MASS)
model.full <- lm(points ~ gender + age + attitude + deep + stra + surf, 
                 data = learning2014)
stepAIC(model.full, direction = "both")
## Start:  AIC=558.36
## points ~ gender + age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - gender    1      0.11 4408.5 556.36
## - surf      1     47.79 4456.1 558.15
## - deep      1     49.00 4457.3 558.19
## <none>                  4408.3 558.36
## - stra      1     83.78 4492.1 559.48
## - age       1     87.88 4496.2 559.63
## - attitude  1    919.82 5328.2 587.82
## 
## Step:  AIC=556.36
## points ~ age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     47.69 4456.1 556.15
## - deep      1     49.11 4457.6 556.20
## <none>                  4408.5 556.36
## - stra      1     88.35 4496.8 557.66
## - age       1     90.16 4498.6 557.72
## + gender    1      0.11 4408.3 558.36
## - attitude  1    999.18 5407.6 588.27
## 
## Step:  AIC=556.15
## points ~ age + attitude + deep + stra
## 
##            Df Sum of Sq    RSS    AIC
## - deep      1     26.61 4482.8 555.14
## <none>                  4456.1 556.15
## + surf      1     47.69 4408.5 556.36
## - age       1     75.36 4531.5 556.93
## - stra      1    106.07 4562.2 558.05
## + gender    1      0.02 4456.1 558.15
## - attitude  1   1084.38 5540.5 590.30
## 
## Step:  AIC=555.14
## points ~ age + attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4482.8 555.14
## - age       1     76.62 4559.4 555.95
## + deep      1     26.61 4456.1 556.15
## + surf      1     25.20 4457.6 556.20
## - stra      1     97.64 4580.4 556.71
## + gender    1      0.01 4482.8 557.14
## - attitude  1   1060.72 5543.5 588.39
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)          age     attitude         stra  
##    10.89543     -0.08822      3.48077      1.00371

The produced model is the same we derived earlier.

Brute force

Finally, we can try to brute force all linear models of simple combination of explanatory variables.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:datasets':
## 
##     rivers
res <- olsrr::ols_step_all_possible(model.full)

res[order(res$adjr, decreasing = TRUE), ]
##    Index N                         Predictors     R-Square Adj. R-Square
## 62    57 5        age attitude deep stra surf 0.2311318189   0.207104688
## 33    22 3                  age attitude stra 0.2181719645   0.203693668
## 52    42 4             age attitude deep stra 0.2228135019   0.203504521
## 54    43 4             age attitude stra surf 0.2225665343   0.203251417
## 63    63 6 gender age attitude deep stra surf 0.2311510113   0.202137842
## 43    44 4           gender age attitude stra 0.2181729919   0.198748718
## 57    58 5      gender age attitude deep stra 0.2228168858   0.198529914
## 59    59 5      gender age attitude stra surf 0.2226046106   0.198311005
## 53    45 4             age attitude deep surf 0.2157221484   0.196236984
## 56    46 4            attitude deep stra surf 0.2154077956   0.195914822
## 17     7 2                      attitude stra 0.2048096617   0.195052725
## 38    23 3                 attitude deep stra 0.2096692889   0.195033535
## 34    24 3                  age attitude surf 0.2081990315   0.193536051
## 40    25 3                 attitude stra surf 0.2074263400   0.192749050
## 58    60 5      gender age attitude deep surf 0.2165385715   0.192055402
## 12     8 2                       age attitude 0.2011434849   0.191341564
## 61    61 5     gender attitude deep stra surf 0.2158241415   0.191318646
## 27    26 3               gender attitude stra 0.2050965812   0.190376147
## 48    47 4          gender attitude deep stra 0.2098633360   0.190232611
## 32    27 3                  age attitude deep 0.2043132366   0.189578297
## 44    48 4           gender age attitude surf 0.2090640320   0.189413449
## 50    49 4          gender attitude stra surf 0.2079025033   0.188223062
## 39    28 3                 attitude deep surf 0.2023788945   0.187608133
## 22    29 3                gender age attitude 0.2017804149   0.186998571
## 3      1 1                           attitude 0.1905536672   0.185618019
## 18     9 2                      attitude surf 0.1952865878   0.185412804
## 42    50 4           gender age attitude deep 0.2048827672   0.185128302
## 49    51 4          gender attitude deep surf 0.2040700385   0.184295381
## 16    10 2                      attitude deep 0.1939911024   0.184101423
## 28    30 3               gender attitude surf 0.1970278452   0.182157990
## 8     11 2                    gender attitude 0.1919357660   0.182020867
## 26    31 3               gender attitude deep 0.1952588802   0.180356267
## 47    52 4               gender age stra surf 0.0653293394   0.042107708
## 60    62 5          gender age deep stra surf 0.0707276010   0.041687839
## 37    32 3                      age stra surf 0.0520482669   0.034493605
## 55    53 4                 age deep stra surf 0.0568671481   0.033435276
## 24    33 3                    gender age stra 0.0504456777   0.032861338
## 31    34 3                   gender stra surf 0.0462022812   0.028539360
## 45    54 4               gender age deep stra 0.0514840047   0.027918390
## 51    55 4              gender deep stra surf 0.0510011757   0.027423565
## 21    12 2                          stra surf 0.0363412029   0.024517169
## 25    35 3                    gender age surf 0.0420448279   0.024304917
## 41    36 3                     deep stra surf 0.0407134651   0.022948900
## 10    13 2                        gender stra 0.0346692266   0.022824677
## 46    56 4               gender age deep surf 0.0462679619   0.022572756
## 15    14 2                           age surf 0.0340077514   0.022155086
## 14    15 2                           age stra 0.0331743748   0.021311484
## 36    37 3                      age deep surf 0.0379369858   0.020121004
## 29    38 3                   gender deep stra 0.0357520229   0.017895579
## 35    39 3                      age deep stra 0.0336895603   0.015794923
## 5      2 1                               stra 0.0213517768   0.015384410
## 6      3 1                               surf 0.0208387755   0.014868280
## 11    16 2                        gender surf 0.0267878521   0.014846599
## 30    40 3                   gender deep surf 0.0306219089   0.012670463
## 20    17 2                          deep surf 0.0244545065   0.012484623
## 19    18 2                          deep stra 0.0219453518   0.009944681
## 7     19 2                         gender age 0.0196556536   0.007626889
## 2      4 1                                age 0.0086844365   0.002639829
## 1      5 1                             gender 0.0086318623   0.002586935
## 23    41 3                    gender age deep 0.0198419947   0.001690921
## 9     20 2                        gender deep 0.0088743608  -0.003286690
## 13    21 2                           age deep 0.0087454938  -0.003417138
## 4      6 1                               deep 0.0001029919  -0.005993941
##    Mallow's Cp
## 62    5.003969
## 33    3.684101
## 52    4.724219
## 54    4.775293
## 63    7.000000
## 43    5.683889
## 57    6.723519
## 59    6.767418
## 53    6.190730
## 56    6.255739
## 17    4.447461
## 38    5.442477
## 34    5.746530
## 40    5.906325
## 58    8.021891
## 12    5.205636
## 61    8.169637
## 27    6.388125
## 48    7.402347
## 32    6.550123
## 44    7.567646
## 50    7.807853
## 39    6.950150
## 22    7.073917
## 3     5.395638
## 18    6.416857
## 42    8.432342
## 49    8.600417
## 16    6.684767
## 28    8.056761
## 8     7.109816
## 26    8.422587
## 47   37.292359
## 60   38.175985
## 37   38.038920
## 55   39.042363
## 24   38.370340
## 31   39.247886
## 45   40.155611
## 51   40.255461
## 21   39.287183
## 25   40.107658
## 41   40.382987
## 10   39.632952
## 46   41.234303
## 15   39.769746
## 14   39.942091
## 36   40.957170
## 29   41.409026
## 35   41.835549
## 5    40.387035
## 6    40.493125
## 11   41.262841
## 30   42.469948
## 20   41.745383
## 19   42.264283
## 7    42.737798
## 2    43.006675
## 1    43.017547
## 23   44.699262
## 9    44.967398
## 13   44.994048
## 4    44.781340

By adjusted R-Square, it appears that the three variable model we used before fares very well compared to the full model. Attitude alone is also quite good in comparison and could be chosen by parsimony.


Week 3 - Logistic regression

date()
## [1] "Sat Nov 20 12:56:50 2021"

Introduction

This week, we are analyzing and modeling data on two questionnaires related to student performance and alcohol consumption. Further description is available here. Since the description is relevant for analysis we reprint it verbatim:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The dataset has been preprocessed by joining data from two courses, Math and Portugese language. Variables alc_use and high_use have been added in the wrangling phase.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

rm(list=ls())

# read in the data, convert strings to factors
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
                header = TRUE, sep = ",", stringsAsFactors = TRUE)

dim(alc)
## [1] 370  51

There are 370 rows and 51 columns.

str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

Seems like there are mostly integer and character (factorized) valued variables, as well as at least one logical (high_use) and float (alc_use).

# (Type for factor is int.)
table(sapply(alc, typeof))
## 
##  double integer logical 
##       1      49       1
# Check for zero variance (negate characters to get "not all duplicated")
sort(
  sapply(alc, function(x) { 
    ifelse(is.numeric(x), var(x), !all(duplicated(x)[-1L])) 
    }))
##            n   failures.p     failures   traveltime   failures.m    studytime 
## 0.000000e+00 2.398667e-01 3.109939e-01 4.916502e-01 5.049513e-01 7.189922e-01 
##         Dalc       famrel     freetime      alc_use       school          sex 
## 8.032594e-01 8.304695e-01 9.712224e-01 9.958178e-01 1.000000e+00 1.000000e+00 
##      address      famsize      Pstatus         Mjob         Fjob       reason 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##     guardian    schoolsup       famsup   activities      nursery       higher 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##     internet     romantic         paid       paid.p       paid.m     high_use 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##         Medu         Fedu        goout          age         Walc       health 
## 1.173984e+00 1.179697e+00 1.273720e+00 1.393987e+00 1.666366e+00 1.981220e+00 
##         G2.p         G1.p           G1           G2         G3.p           G3 
## 6.078518e+00 6.512854e+00 7.301699e+00 7.914627e+00 8.665092e+00 1.080868e+01 
##         G1.m         G2.m         G3.m   absences.p     absences   absences.m 
## 1.119153e+01 1.444749e+01 2.124131e+01 2.330626e+01 3.029392e+01 5.876224e+01 
##          cid         id.m         id.p 
## 1.143917e+04 1.274368e+04 3.041907e+04
# Seems OK, only `n` has zero variance.

Variables correlated to alcohol consumption

Let’s apply a data-driven approach to discover the top 4 variables linked to the alcohol consumption. Intuitively, family relationships and characteristics, study time, grades, and peer behavior should have an effect.

We want to compute association between all the variables. To allow for factors, we employ a stackoverflow solution

# From 
# https://stackoverflow.com/questions/52554336/plot-the-equivalent-of-correlation-matrix-for-factors-categorical-data-and-mi/56485520#56485520

library(tidyverse)
library(rcompanion)


# Calculate a pairwise association between all variables in a data-frame. In particular nominal vs nominal with Chi-square, numeric vs numeric with Pearson correlation, and nominal vs numeric with ANOVA.
# Adopted from https://stackoverflow.com/a/52557631/590437
mixed_assoc = function(df, cor_method="spearman", adjust_cramersv_bias=TRUE){
    df_comb = expand.grid(names(df), names(df),  stringsAsFactors = F) %>% set_names("X1", "X2")

    is_nominal = function(x) class(x) %in% c("factor", "character")
    # https://community.rstudio.com/t/why-is-purr-is-numeric-deprecated/3559
    # https://github.com/r-lib/rlang/issues/781
    is_numeric <- function(x) { is.integer(x) || is_double(x)}

    f = function(xName,yName) {
        x =  pull(df, xName)
        y =  pull(df, yName)

        result = if(is_nominal(x) && is_nominal(y)){
            # use bias corrected cramersV as described in https://rdrr.io/cran/rcompanion/man/cramerV.html
            cv = cramerV(as.character(x), as.character(y), bias.correct = adjust_cramersv_bias)
            data.frame(xName, yName, assoc=cv, type="cramersV")

        }else if(is_numeric(x) && is_numeric(y)){
            correlation = cor(x, y, method=cor_method, use="complete.obs")
            data.frame(xName, yName, assoc=correlation, type="correlation")

        }else if(is_numeric(x) && is_nominal(y)){
            # from https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable/124618#124618
            r_squared = summary(lm(x ~ y))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else if(is_nominal(x) && is_numeric(y)){
            r_squared = summary(lm(y ~x))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else {
            warning(paste("unmatched column type combination: ", class(x), class(y)))
        }

        # finally add complete obs number and ratio to table
        result %>% mutate(complete_obs_pairs=sum(!is.na(x) & !is.na(y)), complete_obs_ratio=complete_obs_pairs/length(x)) %>% rename(x=xName, y=yName)
    }

    # apply function to each variable combination
    map2_df(df_comb$X1, df_comb$X2, f)
}

Then we can use the function for correlations

# remove boolean high_use (or recode it to integer/factor)
mixed.cors <- mixed_assoc(alc[, -which(names(alc) %in% c("high_use"))])
library(corrplot)

cor.matrix <- 
  mixed.cors %>% 
  select(x, y, assoc) %>% 
  spread(y, assoc) %>% 
  select(-x) %>%
  replace(is.na(.), 0)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "color",
         tl.cex = 0.5, tl.col = "black")

From the correlation plot, we see that, as expected by definition, Dalc and Walc correlate positively with alc_use. It would be interesting to include these for further analysis to explain the drinking behavior. Since we’re limited to four variables, we choose a coarse approach and only take alc_use.

The grade variables correlate negatively with alc_use; based on the data annotation in the beginning, it is easiest to choose G3 for further analysis. The famrel correlates negatively with alc_use, and seems quite interesting. goout, sex and studytime are interesting as well. Let’s make sure these variables don’t correlate too much.

interesting_vars <- c("alc_use", "G3", "famrel", "goout", "sex", "studytime")

cor.matrix <- 
  alc %>%
    select(interesting_vars) %>%
    mixed_assoc(cor_method = "pearson") %>%
    select(x, y, assoc) %>% 
    spread(y, assoc) %>%
    select(-x)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "color",
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

Seems good!

# Select the subset
alc.sub <- alc[, interesting_vars]

str(alc.sub)
## 'data.frame':    370 obs. of  6 variables:
##  $ alc_use  : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ G3       : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ famrel   : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ goout    : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ studytime: int  4 2 1 3 3 3 3 2 4 4 ...

Further exploration

Begin with the GGally::ggpairs

library(GGally)
library(ggplot2)

p <- ggpairs(alc.sub, mapping = aes(col = sex, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)),  
             upper = list(continuous = wrap("cor", size = 3)))

p

We see that

  • There is approximately equal number of males (n=175) and females (n=195)
  • Males tend to use more alcohol
  • Distribution of grade G3 is similar for both sexes, and is not normal (by shapiro.test()), has a peak and is left skewed
  • famrel correlates with alc_use, but significantly only for males
  • Males report better quality family relationships
  • Males use less time for studying
  • goout correlates positively with alcohol use with a high significance
  • studytime correlates negatively with alcohol use with a high significance

All in all, there are several hypotheses

  • Getting a better grade while being male is associated with lower alcohol use
  • Family relationships affect alcohol use
  • Alcohol is used when going out
  • Alcohol is not used when studying

Let’s take a closer look on the quality of family relationship versus alcohol use

library(tidyverse)

alc_var <- function(alc, expl.variable) {
  expl.vars <- c(expl.variable, "alc_use", "sex")
  
  # Tidyverse black magic
  res <- 
    alc %>%
      select(one_of(expl.vars)) %>%
      count(!!!sapply(expl.vars, as.symbol))
  
  g <- ggplot(res, aes(x = .data[[expl.variable]], y = alc_use, col = sex)) + 
    geom_point(aes(size = n), alpha = 0.5) + geom_smooth(method = "loess")
  
  return(g)
}

alc_var(alc, "famrel")

With a bit of caution, it appears than men with low quality famrel use more alcohol, as there is an increase in alc_use when famrel goes from medium to bad. However, the relationship is roughly linear (without deep analysis):

summary(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## 
## Call:
## lm(formula = alc_use ~ famrel, data = alc[alc$sex == "M", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7458 -0.9118 -0.1898  0.8102  3.0882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.30184    0.38377   8.604 4.55e-15 ***
## famrel      -0.27800    0.09368  -2.968  0.00343 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.128 on 173 degrees of freedom
## Multiple R-squared:  0.04844,    Adjusted R-squared:  0.04294 
## F-statistic: 8.807 on 1 and 173 DF,  p-value: 0.003427
# make the plot tighter
op <- par(mfrow=c(2, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))

par(op) # return original par

library(car)
# There is no heteroscedasticity by Breusch-Pagan test
car::ncvTest(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.398195, Df = 1, p = 0.52802

Next, plot the other variables as well

library(ggpubr)

g1 <- alc_var(alc, "goout")
g2 <- alc_var(alc, "sex")
g3 <- alc_var(alc, "studytime")
g4 <- alc_var(alc, "G3")

ggarrange(g1, g2, g3, g4,
          ncol = 4, nrow = 1, common.legend = TRUE, 
          legend = "bottom")

When goout increases, so does the alcohol use. Males appear to use more alcohol. As studytime increases, alcohol usage drops. G3 does not have a clear linear pattern with alcohol use. Otherwise, the above hypotheses are OK.

Logistic regression

Now we are ready to fit the logistic regression model with G3, famrel, goout, sex and studytime.

model.1 <- glm(paste0("high_use", " ~ ", paste(interesting_vars[-1], collapse=" + ")),
               family = "binomial", data = alc)

summary(model.1)
## 
## Call:
## glm(formula = paste0("high_use", " ~ ", paste(interesting_vars[-1], 
##     collapse = " + ")), family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6895  -0.7784  -0.4891   0.8155   2.6807  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.85026    0.85893  -0.990  0.32222    
## G3          -0.03667    0.04031  -0.910  0.36294    
## famrel      -0.41604    0.13989  -2.974  0.00294 ** 
## goout        0.77038    0.12402   6.212 5.25e-10 ***
## sexM         0.80171    0.26695   3.003  0.00267 ** 
## studytime   -0.46034    0.17229  -2.672  0.00754 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.86  on 364  degrees of freedom
## AIC: 381.86
## 
## Number of Fisher Scoring iterations: 4

We got a model with four significant variables. G3 is not significant, remove it and refit.

model.2 <- glm(high_use ~ famrel + goout + sex + studytime,
               family = "binomial", data = alc)

summary(model.2)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + sex + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733  0.08309 .  
## famrel       -0.4193     0.1399  -2.996  0.00273 ** 
## goout         0.7873     0.1230   6.399 1.57e-10 ***
## sexM          0.7959     0.2669   2.982  0.00286 ** 
## studytime    -0.4814     0.1711  -2.814  0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4

There is a lot of output

  • Fitted model is \(log \frac{Pr(high\_use =1)}{ 1 - Pr(high\_use = 1)} = log \frac{Pr(high\_use =1)}{Pr(high\_use = 0)} = -1.2672 - 0.4193*famrel + 0.7873 * goout + 0.7959 * sexM - 0.4814 * studytime\)
  • The coefficients represent a change in the log odds of the response when the explanatory variable increases by one, conditional on the other explanatory variables remaining constant
  • All of the variables (and the non-zero intercept with 0.10 p-value cutoff) are significant by Wald test
  • Deviance is similar in concept to the squared error in ordinary regression and quantifies difference between the observed and predicted proportions/probabilities
  • Residual deviance is the sum of squares for residuals
  • Null deviance is the deviance for null model (high_use ~ 1)
  • AIC is the Akaike information criterion and can be used for model selection
  • Number of Fisher Scoring iterations is related to how many iterations are need to fit the model
# Residual deviance
sum(residuals(model.2, type = "deviance")^2)
## [1] 370.6854

See if residuals are roughly linear, otherwise we can’t use GLM.

# Plot residuals and check that linearity is OK
plot(model.2, which = 1)  # -> Looks OK.

We can also test with ANOVA if null model is significantly different from our model

# see http://homepage.stat.uiowa.edu/~rdecook/stat3200/notes/LogReg_part2_4pp.pdf
# and https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf
# and https://stats.stackexchange.com/questions/59879/

# 1-pchisq(model.2$null.deviance-model.2$deviance, model.2$df.null-model.2$df.residual)
null <- glm(high_use ~ 1, family = "binomial", data = alc)
anova(null, model.2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ 1
## Model 2: high_use ~ famrel + goout + sex + studytime
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       369     452.04                          
## 2       365     370.69  4   81.354 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conclude that at least one of the explanatory variables has non-zero coefficient by this test.

Compute odds ratios and confidence intervals

OR <- coef(model.2) %>% exp
CI <- confint(model.2) %>% exp

list(OR = OR, CI = CI)
## $OR
## (Intercept)      famrel       goout        sexM   studytime 
##   0.2816141   0.6575181   2.1974324   2.2165443   0.6179355 
## 
## $CI
##                  2.5 %    97.5 %
## (Intercept) 0.06545486 1.1622100
## famrel      0.49791884 0.8636137
## goout       1.73873662 2.8198312
## sexM        1.31841735 3.7630180
## studytime   0.43751933 0.8576353

The odds ratio for famrel is 0.66 and 95% CI is [0.50, 0.86]. Thus, increase by 1 unit in famrel is associated with a decrease in the odds of high_use by 14-50%.

Variables goout (increases high_use odds) and studytime (decreases high_use odds) are interpreted analogously and neither containts 1 in the 95% CI.

The sexM is interpreted in relation to implicit sexF: being male increases the high_use odds by around 122% with 95% CI of [1.32, 3.76].

It seems that the results correspond to the hypotheses. It feels like famrel would require more work although the interaction term of sex*famrel is not significant when added to model.2 (data not shown).

Predictive power

observed <- alc$high_use
predicted <- predict(model.2, type = "response") > 0.5

# 2x2 tabulation, confusion matrix
table(observed, predicted)
##         predicted
## observed FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52

Most cases are classified correctly.

Below are the predictions versus actual high_use values, plotted separately for each explanatory variable.

# Get predictions, compare against true values
predicted_abs <- predict(model.2, type = "response")

df <- data.frame(cbind(observed, predicted, predicted_abs, 
                       alc[, interesting_vars[3:6]]))

df$matches <- df$observed == df$predicted

# Plot by explanatory variables
prob.expl <- function(df, ind.var, expl.var, col.var) {

  g <- ggplot(df, aes(x = .data[[ind.var]], 
                      y = .data[[expl.var]])) + 
    geom_point(alpha = 0.5, aes(col = .data[[col.var]])) + 
    geom_smooth(method = "loess", col = "black", size = 0.5) +
    theme_bw()
  g
}
  
g1 <- prob.expl(df, "predicted_abs", "famrel", "matches")
g2 <- prob.expl(df, "predicted_abs", "goout", "matches")
g3 <- prob.expl(df, "predicted_abs", "sex", "matches")
g4 <- prob.expl(df, "predicted_abs", "studytime", "matches")

ggarrange(g1, g2, g3, g4, ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"),
          common.legend = TRUE, legend = "bottom")

From the plot above we see where the predictions of high_useare right (greenish points) and wrong (red points). For famrel (A), sex (C) and studytime (D), the predictions seem to be false mostly around probability 0.5. With goout, high_use appears to be poorly predicted when goout is low and predicted value is in the range of [0.5, 0.7].

Quantify at which probability values mismatches tend to occur:

# Predictions equal observation
round(table(cut_interval(df[df$matches == TRUE, c("predicted_abs")], length=0.20)) /
  length(df$matches), 2)
## 
##   [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
##      0.36      0.22      0.08      0.08      0.02
# False predictions
round(table(cut_interval(df[df$matches == FALSE, c("predicted_abs")], length=0.20)) /
  length(df$matches), 2)
## 
##   [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
##      0.06      0.05      0.10      0.03

This confirms that most false predictions are made around the probability 0.5.

The training error of the model is 0.24, i.e. lowish. If we just guessed using, say, goout, the error would be around 0.28, which is not very bad. Of course the guessing is not very random, since we already know that goout is an important predictor.

# Training error, model.2
sum(df$matches == FALSE) / length(df$matches)
## [1] 0.2378378
# Guess high_use by goout
guesses <- rep(FALSE, times = nrow(df))
guesses[df$goout > mean(df$goout)] <- TRUE

table(df$observed, guesses)
##        guesses
##         FALSE TRUE
##   FALSE   198   61
##   TRUE     41   70
sum(guesses != df$observed) / nrow(df)
## [1] 0.2756757

Cross-validation

Follow the code in DataCamp for this one

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, df$predicted_abs)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model.2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486486

The test set performance is a bit better than the DataCamp baseline, 0.25 versus 0.26.

Simple model selection

# Select all explanatory variables available
training.vars <- colnames(alc)[1:36] # Discard non-joined .m and .p variables

# Discard alcohol consumption variables and technical variables
training.vars <- training.vars[!(
  training.vars %in% c("Dalc", "Walc", "n", "id.p", "id.m"))]

Now, with enough compute we could just try all the variables, but that would amount to 2^32 = 4294967296 models. We instead use stepwise selection by AIC starting with the full model and progressing backwards. We also assume R handles the factor variable to dummy (treatment) encoding. (Note the problems of stepwise regression. It would be advisable to use other methods e.g. through Caret.)

library(MASS)

# Run stepwise AIC selection on full model, keep the model and its AIC
final.m <- stepAIC(
  # Full model
  glm(paste0("high_use", " ~ ", paste(training.vars[-1], collapse=" + ")),
             family = binomial, data = alc),
  direction = "backward", trace = FALSE, 
  # What to keep from models
  keep = function(model, aic) { list(model = model, aic = aic) } )

Then perform 10-fold CV for the models

# Init
CVs  <- rep(NULL, ncol(final.m$keep))  # Test error
ERRs <- rep(NULL, ncol(final.m$keep))  # Train error
AICs <- rep(NULL, ncol(final.m$keep))
Nvar <- rep(NULL, ncol(final.m$keep))

for (i in 1:ncol(final.m$keep)) {  # Each column is a model
  
  interim.m <- final.m$keep[, i][1]$model
  CVs[i] <- cv.glm(data = alc, cost = loss_func, glmfit = interim.m, K = 10)$delta[1]
  
  ERRs[i] <- loss_func(alc$high_use, predict(interim.m, type = "response"))
  
  AICs[i] <- final.m$keep[, i][2]$aic
  
  Nvar[i] <- length(final.m$keep[, i][1]$model$coefficients) - 1
}

CV.res <- data.frame(test = CVs, train = ERRs, AIC = AICs, Nvar = Nvar)
# Somewhat unsatisfactory graph with points

g1 <- ggplot(CV.res, aes(x = Nvar, y = test)) + geom_line() + 
  geom_point(aes(x = Nvar, y = test, color = AIC), size = 5, alpha = 0.8) +
  theme_bw() +  xlab("Number of variables") + ylab("Test set error") +
  ggtitle("10-fold CV on backward stepwise variable selection with AIC") +
  scale_color_gradient(low="blue", high="red")

g1

# Better graph with both test and training error, and AIC separately

errors <- gather(CV.res, type, error, c("test", "train"))

g1 <- ggplot(errors, aes(x = Nvar, color = type)) + geom_line(aes(y = error)) + 
  theme_bw() +  xlab("Number of variables") + ylab("Test set error")

g2 <- ggplot(errors, aes(x = Nvar)) + geom_line(aes(y = AIC)) + 
  theme_bw() +  xlab("Number of variables") + ylab("AIC")
  
g <- ggarrange(g1, g2, ncol = 2, nrow = 1, widths = c(2, 1.5),
          common.legend = FALSE, legend = "left")

annotate_figure(g, top = text_grob("10-fold CV on backward stepwise variable selection with AIC"))

The final model was

summary(final.m)
## 
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian + 
##     activities + romantic + famrel + goout + health + absences, 
##     family = binomial, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8167  -0.7154  -0.4262   0.5635   2.7612  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.14937    0.86318  -2.490 0.012772 *  
## sexM              0.98275    0.28722   3.422 0.000623 ***
## addressU         -0.89491    0.33661  -2.659 0.007847 ** 
## famsizeLE3        0.47009    0.29971   1.569 0.116759    
## reasonhome        0.31183    0.35612   0.876 0.381229    
## reasonother       1.24517    0.47895   2.600 0.009328 ** 
## reasonreputation -0.28654    0.36969  -0.775 0.438278    
## guardianmother   -0.68303    0.32428  -2.106 0.035179 *  
## guardianother     0.41726    0.71637   0.582 0.560256    
## activitiesyes    -0.51481    0.28383  -1.814 0.069710 .  
## romanticyes      -0.50684    0.30464  -1.664 0.096171 .  
## famrel           -0.50627    0.15580  -3.249 0.001156 ** 
## goout             0.91148    0.13829   6.591 4.36e-11 ***
## health            0.16738    0.10359   1.616 0.106138    
## absences          0.09152    0.02373   3.856 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 335.02  on 355  degrees of freedom
## AIC: 365.02
## 
## Number of Fisher Scoring iterations: 5

Removing all non-significant explanatory variables and iterating yields a model with 0.20 prediction error and 0.22 test set error in 10-fold CV:

parsimonious.m <- glm(formula = high_use ~ sex + address + reason + guardian + 
    activities + famrel + goout + absences, family = binomial, data = alc)

summary(parsimonious.m)
## 
## Call:
## glm(formula = high_use ~ sex + address + reason + guardian + 
##     activities + famrel + goout + absences, family = binomial, 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8386  -0.7558  -0.4676   0.6257   2.8133  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.65723    0.77828  -2.129 0.033225 *  
## sexM              1.09542    0.27999   3.912 9.14e-05 ***
## addressU         -0.84765    0.33097  -2.561 0.010435 *  
## reasonhome        0.24841    0.34895   0.712 0.476533    
## reasonother       1.06110    0.46814   2.267 0.023412 *  
## reasonreputation -0.35044    0.36541  -0.959 0.337531    
## guardianmother   -0.65433    0.31871  -2.053 0.040071 *  
## guardianother     0.27975    0.69303   0.404 0.686459    
## activitiesyes    -0.55234    0.28067  -1.968 0.049075 *  
## famrel           -0.45739    0.15011  -3.047 0.002311 ** 
## goout             0.87831    0.13550   6.482 9.06e-11 ***
## absences          0.08739    0.02335   3.743 0.000182 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 341.90  on 358  degrees of freedom
## AIC: 365.9
## 
## Number of Fisher Scoring iterations: 5
loss_func(alc$high_use, predict(parsimonious.m, type = "response"))
## [1] 0.1972973
cv.glm(data = alc, cost = loss_func, glmfit = parsimonious.m, K = 10)$delta[1]
## [1] 0.2216216

(more chapters to be added similarly as we proceed with the course!)